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Abstract 

Background: Identification of cooperative gene regulatory network is an important topic for biological study 
especially in cancer research. Traditional approaches suffer from large noise in gene expression data and false 
positive connections in motif binding data; they also fail to identify the modularized structure of gene regulatory 
network. Methods that are capable of revealing underlying modularized structure and robust to noise and false 
positives are needed to be developed. 

Results: We proposed and developed an integrated approach to identify gene regulatory networks, which consists 
of a novel clustering method (namely motif-guided affinity propagation clustering (mAPC)) and a sampling based 
method (called Gibbs sampler based on outlier sum statistic (GibbsOS)). mAPC is used in the first step to obtain 
co-regulated gene modules by clustering genes with a similarity measurement taking into account both gene 
expression data and binding motif information. This clustering method can reduce the noise effect from microarray 
data to obtain modularized gene clusters. However, due to many false positives in motif binding data, some genes 
not regulated by certain transcription factors (TPs) will be falsely clustered with true target genes. To overcome this 
problem, GibbsOS is applied in the second step to refine each cluster for the identification of true target genes. In 
order to evaluate the performance of the proposed method, we generated simulation data under different signal- 
to-noise ratios and false positive ratios to test the method. The experimental results show an improved accuracy in 
terms of clustering and transcription factor identification. Moreover, an improved performance is demonstrated in 
target gene identification as compared with GibbsOS. Finally, we applied the proposed method to two breast 
cancer patient datasets to identify cooperative transcriptional regulatory networks associated with recurrence of 
breast cancer, as supported by their functional annotations. 

Conclusions: We have developed a two-step approach for gene regulatory network identification, featuring an 
integrated method to identify modularized regulatory structures and refine their target genes subsequently. 
Simulation studies have shown the robustness of the method against noise in gene expression data and false 
positives in motif binding data. The proposed method has been applied to two breast cancer gene expression 
datasets to infer the hidden regulation mechanisms. The experimental results demonstrate the efficacy of the 
method in identifying key regulatory networks related to the progression and recurrence of breast cancer. 
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Background 

Living cells must be able to correctly respond to internal 
and external stimuli by adjusting gene expression levels 
[1]. Transcription factors (TFs) cooperatively regulate 
genes in forming gene regulatory networks, which plays 
a crucial role in the gene regulation process. Recently, 
biological researchers have shown that some diseases 
like cancer are closely related to the breakdown of regu- 
latory networks, and many oncogenes (i.e., genes closely 
related to cancer) have been shown enrichment in this 
regulation mechanism [2]. Thus identification of tran- 
scriptional gene regulatory networks becomes a promis- 
ing direction in the field of biology and bioinformatics. 
Several statistical methods such as principle component 
analysis (PCA) [3] and independent component analysis 
(ICA) [4] are developed to discover the underlying regu- 
lation mechanism. However, the strong assumption of 
independent or uncorrected components cannot be 
easily satisfied in many real biological applications. Due 
to the fact that genes tend to cooperate to take effect, 
identifying co-expressed genes modules is an intuitive 
way to reconstruct regulatory networks. Therefore some 
clustering based methods such as Fuzzy C-means clus- 
tering [5] have been developed to discover co-expressed 
genes modules. However co-expressed gene modules are 
different from co-regulated genes in which we are inter- 
ested. Co-regulated genes are regulated by some com- 
mon TFs and tend to have similar gene expression 
pattern. On the contrary, co-expressed genes are not 
necessarily regulated by common TFs [6]. Moreover, 
these methods fail to incorporate the motif binding 
information provided by matching DNA upstream 
sequences and TFs with whole genome sequencing tech- 
niques [1]. 

Dynamic Bayesian Network [7] is one of the integra- 
tive methods, and it takes the motif-binding information 
as prior knowledge and learns the network from gene 
expression data. But the method will be hard to analyze 
data with large candidate TF pool, which limits its appli- 
cation to real biological studies. Network component 
analysis (NCA) [8] and several NCA-based methods 
such as FastNCA [9] are among several successful inte- 
grative methods, which are specifically developed to 
interpret gene regulatory network as a bipartite network. 
With some reasonable assumptions referred to as NCA 
criteria [8], NCA can decompose gene expression data 
to estimate the TF activity and then further infer the 
regulation strength. Nevertheless, motif binding data are 
often contaminated with many false positive connections 
and NCA is very sensitive to those false connections. To 
address the problem of false positive connections, Gu 
et al. have developed a regression based Gibbs sampling 
method (namely GibbsOS [10]) to discover true target 



genes from an initial gene pool. GibbsOS employs the 
same model as NCA does and summarizes regression 
t-test statistics into an outlier sum statistic [11], then 
with the help of Gibbs sampling strategy [12], it can 
identify true target genes from the gene pool. However, 
it fails to take modularized regulatory structure into 
consideration; therefore GibbsOS will perform poorly 
when a large number of TF candidates are investigated, 
which significantly limits its application to real biological 
studies. 

The limitations of current methods can be summar- 
ized as follows: (i) being sensitive to contaminations (e. 
g., noise and false positives) in genomic data, (ii) failing 
to identify the modularized structure and (iii) being 
unable to handle a large number of candidate TFs. In 
this paper, we aim at tackling the above-mentioned lim- 
itations by proposing a novel method that combines a 
clustering method with GibbsOS to discover the hidden 
regulation mechanism; the clustering method is called 
motif-guided affinity propagation clustering (mAPC) [2], 
a modified version of affinity propagation clustering 
(APC) [13]. To evaluate the performance, we generate 
some synthetic data under different signal-to-noise 
ratios (SNRs) and numbers of false positive connections, 
with which to show that our method has an improved 
performance in regulatory network identification. 
Besides, two breast cancer patient datasets are used to 
demonstrate the feasibility of the proposed method for 
real biological studies. Experimental results show that 
the proposed method is able to identify active TFs and 
their target genes, hence, to reconstruct the underlying 
regulatory network. 

Results and discussion 

Motif-guided affinity propagation clustering and Gibbs 
sampler based on outlier sum statistics 

The flowchart of the proposed two-step method is 
shown in Figure 1. In the first step, mAPC is applied to 
identify the modularized structure by clustering genes 
into co-regulated modules. Unlike traditional clustering 
methods, mAPC uses both gene expression data and 
motif-binding data to measure the similarity between 
genes, which can reduce the noise effect from microar- 
ray gene expression data for more reliable clustering 
results. Besides gene clustering, TF identification is also 
applied within each cluster in the first step, in order to 
select the closely related TFs for further investigation. 

In the second step, we apply GibbsOS to each cluster 
to remove false positive connections for target gene 
identification. For the convenience of explanation, we 
define true target genes as "foreground" genes and genes 
not regulated by TFs as "background" genes; in such a 
way, GibbsOS can be seen as identifying foreground 
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genes from the entire gene pool. The detailed descrip- 
tion of the method is summarized in the Methods 
section with mathematical details outlined. 

Simulation experiments 

The simulation data are synthesized by MATLAB func- 
tions with 300 genes (which include 100 foreground 
genes and 200 background genes), 80 TFs and 20 
experiments (or samples). The motif binding data are 
generated with modularized structures for both fore- 
ground genes and background genes, and the TF activ- 
ities are randomly generated with Gaussian random 
variables of mean 0 and variance 1. Then the foreground 
gene expression data can be synthesized by a linear 
combination of motif-binding data and TF activities 
using a log-linear model provided by Liao et al. [8] . For 
the background genes, the gene expression data are ran- 
domly generated with Gaussian random variables (of 
mean 0 and variance 1) and the amplitude is modified 
to ensure the equal variance between foreground and 
background gene expression patterns. To perturb the 
data, noise is randomly added to gene expression data 
with certain signal-to-noise ratio (SNR). The level of 
false positives (FPs) added in motif binding data is 
measured by FP ratio, which is defined as the number 
of false positive connections over the number of true 



positive connections within foreground genes. To test 
the performance of the proposed method against noise 
in gene expression and false positives in motif binding 
data, we first fix the SNR level at 5 dB, and then test 
the performances of mAPC clustering and TF identifica- 
tion under three different FP ratios (0.5, 1.0 and 1.5). 
Further, we fix the FP ratio at 1.0 and generate simula- 
tion data under three SNR levels (0 dB, 5 dB and 10 dB) 
to assess the effect of false positives on the performance 
of mAPC-GibbsOS. 

The performance evaluation is done systematically in 
terms of modularized structure reconstruction and tar- 
get gene identification. Firstly, we use the simulation 
data to assess the performance of our clustering method, 
i.e., mAPC. A partition evaluation method, namely 
adjusted rand index (ARI) [14], is used here to compare 
the clustering results with the ground truth of the simu- 
lation data to assess the clustering accuracy (see Meth- 
ods for details). With any two clusters, ARI can be 
calculated and summarized into a value between -1 and 
1 and a higher value of ARI represents more similarity 
between the clusters. If the ARI value is 1, it means that 
the two clusters in comparison are exactly the same. 
Besides mAPC, three classical methods (which are k- 
means clustering, hierarchical clustering and APC) are 
used to compare and show the disadvantage of lacking 
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Table 1 Adjusted rand index values for clustering 
evaluation. 



SNR(dB) 


FP ratio 


mAPC 


APC 


Hierarchical 
clustering 


k-means clustering 


10 


1 


0.6363 


0.4039 


0.3538 


0.3504 




0.5 


0.7026 








5 


1 

1.5 


0.5135 
0.3067 


0.2238 


0.0495 


0.1380 


0 


1 


0.4037 


0.1088 


0.0098 


0.0921 



motif binding information. Table 1 shows the ARI cal- 
culated under different SNRs and FP ratios. It can be 
seen that when SNR decreases from 10 dB to 0 dB, the 
performances of three competing methods tend to drop 
greatly. In contrast, with the support from motif binding 
information, mAPC can still maintain a good perfor- 
mance in clustering under low SNR cases. We can also 
see that mAPC cannot gain much improvement under 
high FP ratios, because when FP ratio is very large, 
motif binding information cannot provide strong sup- 
port; thus for cases with high FP ratios, the trade-off 
parameter X (see Equation (4) in Methods) needs to be 
tuned to emphasize gene expression data more than 
motif binding information. 

Besides the clustering performance, the performance 
of TF identification (see Methods for details) is also 
needed to be evaluated; we use the area under the recei- 
ver operating characteristic (ROC) curve (AUC) as a cri- 
terion for evaluation. Table 2 shows the AUC values 
calculated under different simulation conditions. Con- 
sidering Table 1 and Table 2 together, we can under- 
stand that the performance of TF identification is 
closely related to the accuracy of clustering, which 
implicitly underlines the importance of taking modular- 
ized structure into consideration. Under different experi- 
mental conditions, the performance of TF identification 
is excellent with AUC values above 0.83; it supports that 
the modularized structure can be robustly reconstructed 
by mAPC-GibbsOS. 

After evaluating the performance of clustering and TF 
identification (in revealing modularized structures), we 
need to further test the performance of target gene 
identification. Similar to mAPC performance evaluation. 



Table 2 AUC values for mAPC-based TF identification. 



SNR(dB) 


FP ratio 


Clusterl 


mAPC 

Cluster2 


10 


1 


0.9844 


0.9838 


5 


0.5 
1 

1.5 


0.9650 
0.9225 
0.8369 


0.9656 
0.9225 
0.8313 


0 


1 


0.8850 


0.8850 



*Cluster 1 and 2 are not the same cluster under different conditions 



Table 3 AUC values of target gene identification of 
mAPC-GibbsOS vs. GibbsOS. 



SNR(dB) 


FP ratio 


mAPC-GibbsOS 


GibbsOS 






Clusterl 


Cluster2 




10 


1 


0.8437 


0.8486 


0.6568 




0.5 


0.8296 


0.8021 


0.7099 


5 


1 


0.7625 


0.8417 


0.6290 




1.5 


0.7679 


0.7397 


0.5952 


0 


1 


0.7203 


0.8243 


0.6102 



^Cluster 1 and 2 are not the same cluster under different conditions 



both gene expression noise and false positive connec- 
tions are taken into consideration to investigate their 
effects on the performance. The experimental results are 
also summarized by AUC and the values are shown in 
Table 3. Under the same FP ratio at 1.0, our proposed 
method has an AUC improvement of 0.15 in average for 
each SNR level. From another point of view, when SNR 
is fixed at 5 dB, mAPC-GibbsOS can achieve an average 
improvement of 0.1 under different FP ratios. 

It is also worth noting that the small sample size of gene 
expression data is a common problem for robust identifi- 
cation of gene regulatory networks. In order to study the 
impact of the small sample size onto the performance of 
our proposed method, we have generated gene expression 
data and motif-binding data under 5 different sample sizes 
(with 5, 15, 25, 35 and 45 samples) with 300 genes and 80 
TFs. The noise is generated under standard normal distri- 
bution with a frxed SNR at 5 dB. FP ratio for motif-binding 
data is set to 0.5. Then AUC values of both mAPC-Gibb- 
sOS and GibbsOS are calculated 5 times for each condi- 
tion. The experimental results are shown in Figure 2 and 
Table 4. It can be seen from Figure 2 that the small sample 
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Table 4 AUC values of target gene identification under 
different sample sizes. 

Sample Size 
5 15 25 35 45 

Max 0.7077 0.8421 0.8704 0.9724 0.9742 

Clusterl Median 0.6886 0.8090 0.8381 0.9435 0.9732 

Min 0.6333 0.7765 0.7937 0.9268 0.9492 

mAPC- 

GibbsOS 

Max 0.7456 0.8563 0.8497 0.9678 0.9769 

Cluster2 Median 0.7082 0.8231 0.8338 0.9554 0.9534 

Min 0.6549 0.7953 0.7957 0.9383 0.9460 

Max 0.5223 0.6405 0.8704 0.9687 0.9742 

GibbsOS Median 04933 0.6200 0.8497 0.9554 0.9732 

Min 04313 0.5759 0.7957 0.9268 0.9534 
*Cluster 1 and 2 are not the same cluster under different conditions 

size does impact the performance of the methods; in parti- 
cular, when the number of samples is less than 25, the per- 
formance degrades for both mAPC-GibbsOS and 
GibbsOS. However, the performance degradation of 
mAPC-GibbsOS is much less than that of GibbsOS; as a 
matter of fact, mAPC-GibbsOS outperforms GibbsOS 
with an increase AUC of 0.18 to 0.2, when the sample size 
is smaller than 25. To our understanding, that mAPC- 
GibbsOS suffers less from the small sample size problem 
than GibbsOS may largely be a result of its consideration 
of the underlying modularized structure. By considering 
the underlying modularized structure, the proposed 
method is capable of dividing the data into two or more 
modules with reduced data dimension by gene clustering 
and TF identification. In our case, the data is divided into 
two modularized parts for the method to identify target 
genes; as a result, the target gene identification using 
mAPC-GibbsOS can achieve an AUC value over 0.7, 
which is 0.2 higher than that of GibbsOS. 

Breast cancer microarray data 

Our method is further tested upon two estrogen receptor 
(ER) related breast cancer patient datasets mentioned in 



Symmans et al. [15] and Loi et al. [16] to identify gene 
regulatory networks. The patient samples in the two 
datasets are divided into 'early recurrence' group 
(< 3 years) and 'late recurrence' group (> 6 years) accord- 
ing to survival time. The Symmans et al. dataset [15] 
consists of 21 samples in 'early recurrence' group and 41 
samples in 'late recurrence' group, and the Loi et al. data- 
set [16] has 49 samples in 'early recurrence' group and 76 
samples in 'late recurrence' group. An initial gene set is 
selected by T-test on gene expression data between 'early 
recurrence' and 'late recurrence' groups with a threshold 
p-value of 0.05. In this study, we analyze the up-regulated 
genes (over-expressed in 'early recurrence' group) and 
down-regulated genes (over-expressed in 'late recurrence' 
group) separately. For Symmans et al. data [15], totally 
615 up-regulated genes and 344 down-regulated genes 
are selected, while there are 668 up-regulated genes and 
559 down-regulated genes selected for Loi et al. data 
[16]. Motifs are selected from ER related signaling path- 
ways and binding sites [17], which are believed to have 
strong connections with cancer progression. Finally 88 
and 84 motifs are chosen for Symmans et al. data [15] 
and Loi et al. data [16] respectively. 

Applying the proposed mAPC-GibbsOS method to the 
two breast cancer datasets, we can find consistent 
results in the TF layer. The analysis is done for up-regu- 
lated genes and down-regulated genes separately; for 
each case, we identify two clusters of genes and then 
motif enrichment is conducted within each cluster. In 
Figure 3, the Venn diagrams show that totally 66 up- 
regulated motifs and 48 down-regulated motifs are iden- 
tified from Loi et al. data [16], and 59 up-regulated 
motifs and 44 down-regulated motifs are identified from 
Symmans et al. data [15]. About 65% of the motifs are 
shared among these two datasets; this large overlap indi- 
cates that the similarity between the regulation mechan- 
isms enriched in these two datasets is high. 
Furthermore, in order to further study the overlap 
motifs of the two datasets, we have matched the 
enriched motifs to corresponding TFs and incorporated 
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the protein-protein interaction information from HPRD 
database [18]. Then we can construct TF networks with 
protein-protein interactions under each case to help 
study the regulation mechanism. Figure 4(a) and 4(b) 
show the TF networks connected by protein-protein 
interaction in terms of up- and down-regulated net- 
worlcs, respectively. It can be seen that there is a large 
overlap between the two networks and the common 
genes such as JUN, STAT5A and SPl have already been 
shown the involvements in cancer progression. JUN and 
CREB play an important role in MAPK signaling path- 
way and MAPK pathway has been shown as the center 



of some signaling networks controlling the growth, pro- 
liferation and differentiation of many cell types [19]. 
Furthermore, JUN is closely related to drug resistance 
[20] and CREB is a cyclic-AMP response element-bind- 
ing protein and its over-expression or activation is fre- 
quently observed in breast cancer tissues [21]. Besides, 
CEBPA/B TFs have also been shown a strong correla- 
tion with cancer cell growth and differentiation [22]. 
STAT family TFs such as STAT5 play a central role in 
Jak-STAT signaling pathway whose importance has 
already been stressed in [23]. The involvement of 
POU2F1 in hormonal signals [24] can also lead to 




C . D 

'Early recurrence' 'Late recurrence ' Early recurrence 'Late recurrence' 




Figure 4 Identified TF networks (as connected by protein-protein interactions): (A) up-regulated TFs, (B) down-regulated TFs, and 

expression patterns of target genes of selected TFs on: (C) Symmans ef a!, data [15] and (D) Loi ef al. data [16]. 
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cancer progression, and SPl is demonstrated to be one 
of the TPs that either enhance or repress the activity of 
promoters of genes involved in differentiation, cell cycle 
progression and oncogenesis [25]. ETSl and ELKl are 
members of ETS family genes that are often over- 
expressed in breast cancer [26]. In Figure 4(c) and 4(d), 
the gene expression patterns for identified target genes 
of the overlapped TPs in (a) and (b) are shown on Sym- 
mans et al. data [15] and Loi et al. data [16] respec- 
tively. We can see that the target genes show both up- 
regulated and down-regulated patterns, because corre- 
sponding TPs involve in both regulation directions. If 
considering the up-regulation and down-regulation 
separately, we can observe consistent expression pat- 
terns, which further support the involvement of identi- 
fied TPs in regulation. 

To further demonstrate the robust performance of 
mAPC-GibbsOS, a bootstrapping procedure is used in 
this breast cancer study to provide the confidence level 
of the identified TPs. The rationale behind this approach 
is that the TPs with greater frequency included in the 
identified networks should be more confident. The non- 
parametric bootstrapping process generates different 
datasets many times by re-sampling the experimental 
samples with replacement. Our proposed method, 
mAPC-GibbsOS, is applied on each newly generated 
data, and the times are counted for TPs identified with 
a certain confidence level (which is set to 0.9 in this 
study). Totally 100 bootstrap versions of the gene 
expression data are used in this bootstrapping analysis; 
the results are summarized in terms of confidence score, 
which is calculated as the frequency of TPs (shown in 
Pigure 5). As seen from Pigure 5, all the identified TPs 
have a confidence score greater than 0.3 and most of 
them appear more than 50 times. It can also be seen 
that some TPs such as CEBPA, ETSl, JUN, SPl and 
STAT family TPs are identified confidently by mAPC- 
GibbsOS in both up-regulated and down-regulated net- 
works. Furthermore, these two regulation networks also 
have some specific TPs like ATP4, CEBPB, ELKl, MYB 
and POU2P1 in up-regulated network and MYC in 
down-regulated network. We have further converted the 
p-value (as calculated by Equation (5) in the Methods 
section) from TP identification to a score, which is the 
inverse cumulative distribution function (ICDP) of the 
corresponding p-value under standard normal distribu- 
tion. The upper limit of the score is set to 4 to remove 
some extreme values calculated by the ICDP. Pigure 6 
shows the variation of the score in terms of boxplot. 
Most TPs shown have a median value larger than 1.65 
which is the score threshold under confidence level 0.9. 
Due to the cut-off threshold at 4, some strong TP only 
shows a bar in the box plot, which means that over 75% 
of the scores reach the upper limit. As seen from 




Figure 5 Bootstrapping confidence scores of identified TFs: (A) 

up-regulated TFs, and (B) down-regulated TFs. 



Pigure 6, TPs identified on Loi et al. data [16] have very 
high and stable score distributions. Por Symmans et al. 
data [15], the variation is a little larger, but most of the 
scores vary in the high confident region. 

Conclusion 

In this paper, we have proposed a new method consisting 
of a clustering method (i.e., mAPC) and a sampling based 
method (i.e., GibbsOS) to tackle the problem of regula- 
tory network identification. mAPC is different from tradi- 
tional clustering methods in terms of constructing co- 
regulated gene modules by utilizing both microarray 
gene expression data and motif binding information. Pol- 
lowing mAPC, GibbsOS is applied to refine the module 
for target gene identification to solve the issue of false 
positive connections in motif binding data. 

The proposed method is tested by simulation data with 
different SNRs and PP ratios. Significant improvements 
have been observed in terms of both gene module identi- 
fication and target gene identification. To further test the 
method with real biomedical applications, two breast 
cancer patient datasets are used for the identification of 
regulatory networks related to recurrence of breast can- 
cer. As a result, a key set of regulatory networks has been 
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Figure 6 Bootstrapping score variation for up-regulated TFs: (A) Symmans ef ai. data [15], (B) Loi et al. data [16]and down-regulated TPs: 

(C) Symmans ef al. data [15], (D) Loi ef al. data [16]. 



reconstructed with active transcription factors and their 
target genes. Importantly, these regulatory networks are 
functionally enriched in the progression and recurrence 
of breast cancer, warranting further investigations to 
assess their functional roles by biological experiments. 

Methods 

mAPC-GibbsOS is an integrative method that focuses 
on identifying modularized gene regulatory network. We 
use the log-linear model proposed by Liao et al, [8] 

X = AS + r, (1) 

where X is an N x K matrix representing the measured 
gene expression data, A is the regulation strength matrix 
with a dimension of N x M, M x K matrix S specifies the 



TF activities, T represents the inevitable experimental 
noise, N is the number of genes, K is the number of 
experiments and M is the number of TFs. This model 
interprets the regulatory mechanism as a bipartite net- 
work and the expression of gene can be considered as a 
direct result from TF activity associated with related reg- 
ulation strength. Based on this model, we can divide the 
whole gene set into two distinct categories: (1) "fore- 
ground" genes that are truly regulated by TFs and (2) 
"background" genes that are not related with TFs. It can 
also be seen that only the foreground genes will hold the 
relationship between gene expression and TF activity, 
therefore, it is necessary to identify modularized structure 
on the foreground gene set rather than the whole 
gene set. 
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Motif-guided affinity propagation clustering (mAPC) 

Our first goal is to use clustering methods to group the 
genes regulated by common TFs. Recently one innova- 
tive clustering method called affinity propagation clus- 
tering (APC) [13] has been developed, which has shown 
promising results for clustering data points. APC itera- 
tively passes messages between each pair of nodes and 
generates several 'exemplars' that show the characteris- 
tics of their own clusters best. The messages passed 
include "responsibility" and "availability" and they 
together indicate how appropriate to choose a point as 
exemplar. The application of APC in clustering of gene 
expression data usually uses a Euclidean distance to 
measure similarity, 

s(i,j) = -||xi -Xj||^, (2) 

where s{i,j) is the similarity between gene i and j, Xi,Xj 
are the gene expression data of gene i and j. It implies 
that the larger Euclidean distance between gene j and ; 
is, the less similarity between them is. However, this 
cost function only considers clustering genes with simi- 
lar gene expression patterns, thus it can only cluster co- 
expressed genes rather than co-regulated genes. Since 
gene expression data are relatively noisy, this cost func- 
tion tends to make the clustering results not reliable. In 
order to solve this problem, we propose a new cost 
function to incorporate motif binding information. Sup- 
pose we have a binding strength matrix W with the 
same dimension of the regulation matrix A (N x M) 
showing the possible connections between N genes and 
M TFs. w(i, j) will be a value taking either 0 or 1 repre- 
senting the possible binding connection between gene 
/and TF j. Given a candidate TF j, the probability that 
gene k and t are simultaneously regulated by TF j is 
proportional to w(k, j) x w(t, j). Hence, considering all 
available M TFs, the co-regulation probability between 
gene k and t from motif binding data will be propor- 
tional to 

Sregik t) = U> {k,j) X W {t,j) . (3) 

Incorporating co-regulation information described in 
Equation (3) into the classical similarity measurement, 
we can formulate our new cost function for APC as: 

S (i, j) = - (1 - A.) I |xi - Xj 1 1^ + A. Sregiij), (4) 

where X is a trade-off parameter between 0 and 1 to 
adjust the contribution of gene expression data and 
that of motif binding information. If X is 1, the clus- 
ters generated by mAPC will totally depend on motif 
binding information. On the contrary, if X is 0, 
mAPC turns out to be the classical APC as applied 
to gene expression data alone. As gene expression 



data are noisy, the second term can lower the noise 
effect by the positive support from binding informa- 
tion. On the other hand, the false connections existed 
in matrix W will be penalized by large negative gene 
expression similarity measurement, because genes not 
co-regulated do not share similar gene expression 
patterns. In general, this type of balanced cost func- 
tion will provide us a better representation of gene 
modules in terms of both co-regulation and co- 
expression. 

Transcription factor identification 

After obtaining clusters of genes, we have already 
uncovered the modularized structure for genes, but we 
need to further associate clusters of genes with TFs, 
which can be accomplished by testing TF enrichment 
for every cluster. A hyper-geometric test is applied to 
each TF to tackle this problem with the following 
hypotheses: 

Ho : TFj is not enriched in cluster c; Hi : TF j is enriched in cluster c. 

Assume that there are Nc genes in cluster c and 
Nbgenes have connections with TF j. In the whole gene 
population, the total number of genes is N and that of 
TF j related genes is Ng. Then we generate a null distri- 
bution by randomly sampling all the N genes to form 
random clusters with size Nc many times. For each clus- 
ter, we count the number of genes regulated by TF j, 
denoted as n. We can then calculate p-value as the per- 
centage of the clusters which have a number n greater 
than Nb. In fact, this process can be modelled by a 
hyper-geometric distribution and the p-value can be cal- 
culated as follows: 

Based on the calculated p-value, we can determine the 

enrichment of TFs in different clusters with a pre- 
defined threshold (which can be adjusted according to 
various cases). 

Gibbs sampler based on outlier sum statistics (GibbsOS) 

GibbsOS is a method that exploits both gene expression 
data and motif binding information to identify fore- 
ground genes from a possible large pool of genes. 
Firstly, suppose we can find a foreground gene set 
0 = 02, ■ • ■ , ©m] as "seed" genes, where M seed genes 
are selected for each of M TFs. Then according to 
Equation (1), 

Ee = A&S, (6) 

where Eq is the gene expression of M seed genes, Aq 
refers to the binding matrix of the seed genes and S is 
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the corresponding TF activity. Solving the Equation (6) 
for S, 



S = (A0)-i£0 
Given another foreground gene y, 
Ey = AyS = AyaE0 =PEq, 



(7) 



(8) 



where Ey and Ay are the gene expression and binding 
matrix of gene y and P = Aya infers the relationship 
between gene y and the seed genes. If gene y is a fore- 
ground gene truly regulated by TF j, then /Sj should be 
not equal to 0, due to the similarity shared between 
gene y and seed gene of TF j. Therefore, the regulation 
relationship between gene y and TF j can be solved by 
linear regression hypothesis test on the parameter esti- 
mated, 

Ho:Pj = 0;Hi :Pj ^0. 

By iteratively conducting the test for gene y and all 
the TFs, the regulation mechanism of gene y can be 
determined. The estimation of P, denoted as p, can be 
obtained by least square estimation, 



P = (F^F) ^F^y = CFV, 



(9) 



where we replace Ey and E© by y and F for the conveni- 
ence of representation. The mean squared error (MSE) of 
this fitted model with K experiments and M TFs is 



MSE : 



(y-F^)^(y-Fp) fy.f^cF^y 



(10) 



degree of freedom 



K-M- 1 



where degree of freedom is K — M — 1. Then a signifi- 
cant test statistic of regression coefficients can be 
applied, 



VMSE-yq^' 



(11) 



where Cmm is the m-th diagonal element of the matrix 
C- This test statistic t will follow a Student t-distribution 
with a degree of freedom of K — M — 1 , and then we 
can obtain corresponding p-value to make decision on 
hypothesis with certain predefined confidence level. 

Although we have already demonstrated the method to 
identify foreground genes, we actually do not know the 
ground truth behind the data. It is impossible to accu- 
rately draw foregrounds as seed genes, but we are sure 
that there must be multiple true foreground genes in the 
pool, thus we can make those foreground genes support 
each other. This approach can be completed in an itera- 
tive way, assuming that the candidate genes for TF j can 
be divided into 4>jF and $jB (<tjF n <I)jB = 0 and 



OjF U iI>iB = <tj), two sets containing foreground genes 
and background genes respectively. To start the iteration, 
we randomly select one gene 6ji(l < i < Kj) from 
where Kj is the cardinality of gene set ^^j. Together with 
the candidate genes for other TFs, we can obtain a fore- 
ground gene list 0 = [0i, 82, . . . , 6j = Oji, . . . , 0m]. Then 
we apply the linear regression T-test mentioned in Equa- 
tion (11); totally Kj — 1 (0ji to 0jKj except 0ji) t statistics 
will be generated for TF j. 

In order to obtain a summarized score for choosing 
0j = 0ji, we use an outlier sums (OS) statistic and it can 
be defined as. 



OS 




ta 



-,K-M-1 



(12) 



where tjk is the t-test statistic of gene k under the con- 
dition that 0j = 0ji, and I( ) is an indicator function 
which outputs 1 ^ith a non-negative value and outputs 
0 otherwise, and —^k-m-i refers to the threshold of a 
confidence level w fth K — M — 1 degree of freedom. As 
interpreted from the above equation, if 0,1 is a good 
choice of foreground genes for TF j, OS will achieve a 
large value. On the contrary, if 0)1 is not supported by 
other genes, OS will have a very small value or even 0. 
Considering Equation (12) as a function of 0; given the 
choices of other TFs, we can then rewrite it as: 



OSe, =/(0j|0i,...,0j-i,0j,i,0M). 



(13) 



The form of OS statistic indicates the dependency of 
the decision of one TF on the choices of foreground 
genes for other TFs, but what we are interested in is the 
marginal function which is independent of the choices 
of other TFs. 

Gibbs sampler is a fairly good technique which can 
draw samples with respect to marginal distribution with 
only the conditional distribution. For our GibbsOS case, 
we can construct a conditional probability density func- 
tion as follows: 

p(ej|9i 9j_i,9i+i,eM) = :^/(ejiei ej_i,ej^i,eM), (14) 

where the function /(•) is the same function as Equa- 
tion (13) (i.e., the outlier sum statistic function) and Kq 
is a normalization constant which ensures the total 
probability equals one. Then after initializing a set of 
candidate genes [Q°, G", . . . , sample the 

genes as: 



(15) 



where t denotes the t—ih sampling iteration. At each 
iteration step, we sequentially sample one candidate 
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foreground gene for each TF once. When going through 
sufficient steps, we not only sample those candidate 
genes, but also estimate the marginal distributions. In 
our case, we can simply use the frequency of a gene 
emerged in the sampled sequence to approximate the 
empirical marginal distribution. Then the genes with 
higher frequency will be more probable to be fore- 
ground genes. 

Adjusted rand index for performance evaluation 

To evaluate the clustering performance, a partition eva- 
luation method, namely adjusted rand index (ARI) [14], 
is used in this study. We will compare the results with 
the ground truth of the simulation data to assess the 
clustering accuracy. Suppose the resulting N clusters are 
C = {ci, C2, . . . , cn} and the M ground truth clusters are 
G = {gj, g2' ■■■ ' SmI- Let "ij be the number of elements 
existed in both cluster Q and Sj, and ni.and n.j are the 
total numbers of genes in Q and Sj, respectively. Then 
the ARI can be calculated as follows: 



ARI : 



E 



(16) 



Note that ARI will take a value between -1 and 1 and 
a higher value represents that two clusters are of more 
similarity. If the ARI value is 1, it means that the two 
clusters in comparison are the same. 
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